21 research outputs found
Conditional Similarity Networks
What makes images similar? To measure the similarity between images, they are
typically embedded in a feature-vector space, in which their distance preserve
the relative dissimilarity. However, when learning such similarity embeddings
the simplifying assumption is commonly made that images are only compared to
one unique measure of similarity. A main reason for this is that contradicting
notions of similarities cannot be captured in a single space. To address this
shortcoming, we propose Conditional Similarity Networks (CSNs) that learn
embeddings differentiated into semantically distinct subspaces that capture the
different notions of similarities. CSNs jointly learn a disentangled embedding
where features for different similarities are encoded in separate dimensions as
well as masks that select and reweight relevant dimensions to induce a subspace
that encodes a specific similarity notion. We show that our approach learns
interpretable image representations with visually relevant semantic subspaces.
Further, when evaluating on triplet questions from multiple similarity notions
our model even outperforms the accuracy obtained by training individual
specialized networks for each notion separately.Comment: CVPR 201
Probabilistic Meta-Representations Of Neural Networks
Existing Bayesian treatments of neural networks are typically characterized
by weak prior and approximate posterior distributions according to which all
the weights are drawn independently. Here, we consider a richer prior
distribution in which units in the network are represented by latent variables,
and the weights between units are drawn conditionally on the values of the
collection of those variables. This allows rich correlations between related
weights, and can be seen as realizing a function prior with a Bayesian
complexity regularizer ensuring simple solutions. We illustrate the resulting
meta-representations and representations, elucidating the power of this prior.Comment: presented at UAI 2018 Uncertainty In Deep Learning Workshop (UDL AUG.
2018
A Generative Model of Words and Relationships from Multiple Sources
Neural language models are a powerful tool to embed words into semantic
vector spaces. However, learning such models generally relies on the
availability of abundant and diverse training examples. In highly specialised
domains this requirement may not be met due to difficulties in obtaining a
large corpus, or the limited range of expression in average use. Such domains
may encode prior knowledge about entities in a knowledge base or ontology. We
propose a generative model which integrates evidence from diverse data sources,
enabling the sharing of semantic information. We achieve this by generalising
the concept of co-occurrence from distributional semantics to include other
relationships between entities or words, which we model as affine
transformations on the embedding space. We demonstrate the effectiveness of
this approach by outperforming recent models on a link prediction task and
demonstrating its ability to profit from partially or fully unobserved data
training labels. We further demonstrate the usefulness of learning from
different data sources with overlapping vocabularies.Comment: 8 pages, 5 figures; incorporated feedback from reviewers; to appear
in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence
201
Channel Vision Transformers: An Image Is Worth C x 16 x 16 Words
Vision Transformer (ViT) has emerged as a powerful architecture in the realm
of modern computer vision. However, its application in certain imaging fields,
such as microscopy and satellite imaging, presents unique challenges. In these
domains, images often contain multiple channels, each carrying semantically
distinct and independent information. Furthermore, the model must demonstrate
robustness to sparsity in input channels, as they may not be densely available
during training or testing. In this paper, we propose a modification to the ViT
architecture that enhances reasoning across the input channels and introduce
Hierarchical Channel Sampling (HCS) as an additional regularization technique
to ensure robustness when only partial channels are presented during test time.
Our proposed model, ChannelViT, constructs patch tokens independently from each
input channel and utilizes a learnable channel embedding that is added to the
patch tokens, similar to positional embeddings. We evaluate the performance of
ChannelViT on ImageNet, JUMP-CP (microscopy cell imaging), and So2Sat
(satellite imaging). Our results show that ChannelViT outperforms ViT on
classification tasks and generalizes well, even when a subset of input channels
is used during testing. Across our experiments, HCS proves to be a powerful
regularizer, independent of the architecture employed, suggesting itself as a
straightforward technique for robust ViT training. Lastly, we find that
ChannelViT generalizes effectively even when there is limited access to all
channels during training, highlighting its potential for multi-channel imaging
under real-world conditions with sparse sensors. Our code is available at
https://github.com/insitro/ChannelViT